Quantitative Methodology (UPF)
“Data Scientists spend up to 80% of the time on data cleaning and 20% on actual data analysis”.
Source: R for Data Science
Why do we need to clean data?
A df is tidy if it fulfills these requirements (Wickham 2014):
Intra-State War Data (Correlates of War)
# A tibble: 442 × 4
WarName WarType SideA SideB
<chr> <dbl> <chr> <chr>
1 First Caucasus 5 Russia Georgians, Dhagestania, Chechens
2 Sidon-Damascus 6 Sidon Damascus & Aleppo
3 First Two Sicilies 4 Austria -8
4 First Two Sicilies 4 Two Sicilies Liberals
5 Spanish Royalists 4 Spain Royalists
6 Sardinian Revolt 4 Austria -8
7 Sardinian Revolt 4 Sardinia Carbonari
8 Greek Independence 5 Ottoman Empire Greeks
9 Greek Independence 5 -8 United Kingdom
10 Greek Independence 5 -8 France
# … with 432 more rows
More datasets, opening the tidyr:: package.
# A tibble: 18 × 11
religion `<$10k` $10-2…¹ $20-3…² $30-4…³ $40-5…⁴ $50-7…⁵ $75-1…⁶ $100-…⁷
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Agnostic 27 34 60 81 76 137 122 109
2 Atheist 12 27 37 52 35 70 73 59
3 Buddhist 27 21 30 34 33 58 62 39
4 Catholic 418 617 732 670 638 1116 949 792
5 Don’t know/r… 15 14 15 11 10 35 21 17
6 Evangelical … 575 869 1064 982 881 1486 949 723
7 Hindu 1 9 7 9 11 34 47 48
8 Historically… 228 244 236 238 197 223 131 81
9 Jehovah's Wi… 20 27 24 24 21 30 15 11
10 Jewish 19 19 25 25 30 95 69 87
11 Mainline Prot 289 495 619 655 651 1107 939 753
12 Mormon 29 40 48 51 56 112 85 49
13 Muslim 6 7 9 10 9 23 16 8
14 Orthodox 13 17 23 32 32 47 38 42
15 Other Christ… 9 7 11 13 13 14 18 14
16 Other Faiths 20 33 40 46 49 63 46 40
17 Other World … 5 2 3 4 2 7 3 4
18 Unaffiliated 217 299 374 365 341 528 407 321
# … with 2 more variables: `>150k` <dbl>, `Don't know/refused` <dbl>, and
# abbreviated variable names ¹`$10-20k`, ²`$20-30k`, ³`$30-40k`, ⁴`$40-50k`,
# ⁵`$50-75k`, ⁶`$75-100k`, ⁷`$100-150k`
Number of TB cases documented by the WHO in Afghanistan, Brazil, and China between 1999 and 2000 (cases & population).
Table A
# A tibble: 6 × 3
country year rate
* <chr> <int> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
Table B
# A tibble: 6 × 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Table C
# A tibble: 12 × 4
country year type count
<chr> <int> <chr> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
Table D
# A tibble: 3 × 3
country `1999` `2000`
* <chr> <int> <int>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
We change the rows and columns of the dataframe keeping the same information.
Pivot wider
pivot_wider(df, names_from, values_from)
Pivot longer
pivot_longer(df, cols, names_to, names_to)
Separate
separate(df, cols, names_to, names_to)
Unite
unite(df, cols, names_to, names_to)
When you click the Render button a document will be generated that includes:
When you click the Render button a presentation will be generated that includes both content and the output of embedded code. You can embed code like this:
[1] 2
Quantitative Methodology (UPF)